Language-independent text categorization by word N-gram using an automatic acquisition of words
نویسندگان
چکیده
We previously proposed the accumulation method, a language-independent text classification method that is based on character N-grams. The accumulation method does not depend on the language structure because this method uses character N-grams to form
منابع مشابه
بازشناسی متون فارسی با استفاده از مدل زبانی n-gram و پالایش گرامری
Abstract Text recognition has been one of the growing research topics in recent years. Many of these researches have focused on recognition of letters and sub-words as a basis for identifying larger text structures such as words, phrases and sentences. This thesis presents a new method in which the recognized sub-words are combined in order to provide meaningful words and sentences in Farsi tex...
متن کاملText Categorization Using n-Gram Based Language Independent Technique
This paper presents a language and topic independent, bytelevel n-gram technique for topic-based text categorization. The technique relies on an n-gram frequency statistics method for document representation, and a variant of k nearest neighbors machine learning algorithm for categorization process. It does not require any morphological analysis of texts, any preprocessing steps, or any prior i...
متن کاملThe Use of Topic Representative Words in Text Categorization
We present a novel way to identify the representative words that are able to capture the topic of documents for use in text categorization. Our intuition is that not all word n-grams equally represent the topic of a document, and thus using all of them can potentially dilute the feature space. Hence, our aim is to investigate methods for identifying good indexing words, and empirically evaluate...
متن کاملComparing Neural Network Approach With N- Gram Approach For Text Categorization
This paper compares Neural network Approach with N-gram approach, for text categorization, and demonstrates that Neural Network approach is similar to the N-gram approach but with much less judging time. Both methods demonstrated here are aimed at language identification. The presence of particular characters, words and the statistical information of word lengths are used as a feature vector. I...
متن کاملA Study Using n-gram Features for Text Categorization
In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm R IPPER indicate that, after the removal of stop words, word sequences of length 2 or...
متن کامل